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Abstract 

This  paper  proposes  a  novel  approach  to  extract  meaningful  content  information 
from  video  by  collaborative  integration  of  image  understanding  and  natural  language 
processing.  As  an  actual  example,  we  developed  a  system  that  associates  faces  and 
names  in  videos,  called  Name-It,  which  is  given  news  videos  as  a  knowledge  source, 
then  automatically  extracts  face  and  name  association  as  content  information.  The 
system  can  infer  the  name  of  a  given  unknown  face  image,  or  guess  faces  which  are 
likely  to  have  the  name  given  to  the  system.  This  paper  explains  the  method  with 
several  successful  matching  results  which  reveal  effectiveness  in  integrating  hetero¬ 
geneous  techniques  as  well  as  the  importance  of  real  content  information  extraction 
from  video,  especially  face-name  association. 
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Given: 

Video 


Transcript 

BUT  ASK  BILL  MILLER  TO  LABEL 
HIS  MUSIC  AND  — 

»  RIGHT  NOW  I’D  DESCRIBE 
MY  MUSIC  AS  VISIONARY  ROCK. 

»  Reporter:  THAT  VISION  LOOKS 
BACK  AS  MUCH  AS  FORWARD, 

TO  MILLER’S  ROOTS  ON  A  WISCONSIN 
INDIAN  RESERVATION. 


Questions: 


What  is  the  name  of 


THE  NATIONAL  LEAGUE  ROOKIE 
OF  THE  YEAR  AWARD  WENT 
TO  A  DODGER,  PITCHER  HIDEO  NOMO. 
THE  JAPANESE-BORN  PITCHER  POSTED 
A  13-6  RECORD  AND  LED  THE  LEAGUE 
WITH  236  STRIKEOUTS. 

NOMO  RECEIVED  18  OF  THE  POSSIBLE 
28  FIRST-PLACE  VOTES. 


ONE  WAY  OR  THE  OTHER 
NEWT  GINGRICH  IS  IN 
THE  PRESIDENTAL  RACE. 

»  THE  SPEAKER,  EVEN  THOUGH 
HIS  NAME  WASN’T  ON  THE  BALLOT. 
THE  SPEAKER  WAS  AN  ISSUE 
ON  TUESDAY  AND  THE  SPEAKER  WILL 
BE  AN  ISSUE  IN  EVERY  RACE  NEXT 
YEAR. 

»  Reporter:  AND  JUDGING 
FROM  OTHER  NUMBERS 
IN  THE  POLL  ~  GINGRICH  — 

AT  LEAST  WHAT  THE  PUBLIC  THINKS 
HE  STANDS  FOR  —  IS  A  FAIRLY  GOOD 
ISSUE  FOR  DEMOCRATS. 


Who  is  NEWT  GINGRICH  ? 


Figure  1:  “Who”  information 


1  Introduction 

As  digital  video  libraries  are  becoming  realistic,  supported  by  several  technological 
innovations  including  MPEG  video  compression,  vast  and  high  speed  disk  arrays, 
high  speed  local/wide  area  networks,  etc.,  content  based  video  indexing  is  becoming 
much  more  important.  Many  research  efforts  are  made  to  achieve  this  goal.  They 
include  image  retrieval  based  on  image  features  including  color  histogram  and  tex¬ 
ture  analysis  [1],  feature  extraction  using  Karhunen-Loeve  (K-L)  transform  [2],  and 
video  structuring  based  on  scene  change  detection  [3].  These  approaches  provide 
successful  results  to  some  extent,  e.g.,  by  using  an  image  retrieval  based  on  color 
histograms,  a  forest  image  may  match  well  with  images  having  trees,  a  sea  side 
image  may  match  well  with  marine  images,  etc.  This  is  because  the  mapping  in 
color  histogram  conversion  gives  adjacency  relations  similar  to  a  person’s  cognitive 
adjacency  relations  between  images,  though,  these  approaches  cannot  be  said  as  real 
content-based  indexing  since  they  don’t  extract  real  content  information. 

We  are  aiming  at  real  content  information  extraction  from  news  videos.  News 
videos  give  us  important  content  information,  e.g..  President  •  •  •  went  to  •  •  •  to  at¬ 
tend  •  •  •  meeting.  Prime  Minister  •  •  •  said  •  •  •  at  that  meeting.  Senate  leader  •  •  • 
talked  about  •  •  •,  etc.  Looking  at  these  types  of  content,  it  can  be  said  that  “who” 
information,  i.e.,  the  face  and  name  association,  is  one  of  the  most  important  infor¬ 
mation  which  can  be  acquired  from  news  videos  (Figure  1).  A  person  can  easily  say 
which  person  is  shown  in  a  video,  e.g.,  a  person  may  identify  Clinton,  Gingrich,  and 
Dole  by  watching  and  hearing  only  5  minutes  of  news  showing  them,  even  though 
the  person  does  not  know  their  faces  beforehand.  However,  this  task  contains  several 
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difficult  steps; 

•  Scene/topics  detection  in  videos, 

•  Face  detection  in  videos, 

•  Identification  of  faces  occur  in  different  scenes/videos, 

•  Natural  language  processing  to  detect  names  in  narrations,  and 

•  Following  context  to  select  a  person  corresponding  to  a  certain  name  who 
occurred  in  video. 

These  procedures  require  cooperative  integration  of  video  structuring,  machine  vi¬ 
sion,  and  natural  language  processing.  As  the  hrst  step  to  delineate  real  content 
information,  we  introduce  a  face  and  name  association  system,  Name-It. 

Input  videos  are  composed  of  transcripts  as  text  information  and  image  sequences. 
Transcripts  may  be  extracted  from  speech  in  the  sound  track  of  videos,  or  from  the 
closed-captions.  We  use  primarily  closed-captions  as  transcripts.  Fxtraction  of 
face  images  from  image  sequences,  as  well  as  extraction  of  name  candidates  from 
transcripts  are  explained  in  Section  4.  Then  a  matching  measurement  between  each 
face  image  and  name  candidate  is  introduced  followed  by  an  actual  face-to-name 
or  name-to-face  association  algorithm.  Finally  experimental  results  are  shown  to 
evaluate  this  method. 


2  Related  Work 

Face  detection/identihcation  has  been  researched  for  a  long  time,  and  there  are 
image  database  systems  which  can  perform  face  similarity  matching;  they  include 
MIT  Photobook  [2]  and  Virage  system.  These  two  systems  use  the  eigenvector  based 
method  for  face  similarity  matching  [4].  It  is  noteworthy  that  Photobook  applied 
their  method  to  more  than  7,500  images  of  about  3,000  people  and  got  successful 
results.  The  results  reveal  that  eigenvector  based  face  similarity  matching  works 
well  to  some  extent. 

Piction  system  [5]  identihes  faces  within  given  captioned  photos,  typically  of  news¬ 
papers.  The  system  extracts  faces  from  a  photo  and  analyzes  captions  to  get  ge¬ 
ometric  constraints  among  faces  which  will  appear  in  the  photo,  then  label  each 
face  as  each  name.  As  far  as  integrating  image  processing  and  text  processing,  this 
approach  is  similar  to  our  method.  However,  Piction  deeply  depends  on  text  infor¬ 
mation,  in  other  words,  natural  language  processing,  because  the  system  requires 
the  captions  to  have  geometric  descriptions,  e.g.,  “top  row,  from  left,  are  Michael, 
Brian,  ...”  In  contrast,  Piction  uses  image  processing  for  face  location,  but  not  face 
identihcation. 
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3  Overview  of  Name-It 

Figure  2  shows  a  typical  composition  of  a  news  video.  A  news  video  consists  of  image 
sequences  which  may  contain  persons’  faces,  and  transcripts  which  may  contain 
persons’  names.  Given  these  news  videos,  an  ideal  face  and  name  association  system 
takes  a  face  as  input,  then  outputs  a  name  of  that  face,  or  takes  a  name  then  outputs 
a  face  of  that  name  (Figure  3).  Obviously  a  human  can  do  this  very  easily  since 
news  videos  are  created  to  be  understandable  for  humans;  even  though  we  do  not 
know  the  faces  of  persons  in  videos,  we  can  identify  each  person.  Although  human 
seem  to  be  able  to  do  this  easily,  this  process  includes  several  complicated,  and  high 
level  processes,  as  follows: 

Scene/topic  detection:  Given  videos  will  be  perceived  as  many  scenes,  e.g.,  a 
scene  in  which  a  news  caster  is  giving  the  introduction  of  a  topic,  a  scene  of 
live  video  of  the  topic,  etc.  Then  those  scenes  will  be  organized  into  a  topic, 
e.g.,  the  visit  of  president  Clinton  to  Ireland,  etc. 

Face  detection:  Face  portions  within  videos  are  immediately  recognized.  In  addi¬ 
tion  to  this,  tracking  of  each  face  is  performed  within  an  image  sequence. 

Face  identification:  If  an  unknown  face  is  given,  while  another  face  which  is  iden¬ 
tical  to  that  face  was  known  to  be  “foo,”  then  the  given  face  can  be  thought 
to  be  “foo.”  To  achieve  this,  face  identihcation  is  necessary.  In  another  case, 
a  footage  tells  face-A  may  be  “bar,”  another  footage  tells  face-B  probably  be 
“bar”  also,  and  face-A  and  face-B  are  identical,  the  likelihood  that  both  are 
“bar”  is  increased. 

Name  extraction:  Names,  or  proper  nouns,  will  be  discriminated  from  transcripts 
using  a  priori  knowledge  or  context  tracking  that  tells  which  words  are  most 
likely  proper  nouns.  Context  information  may  help  in  distinguishing  names 
which  are  likely  to  represent  “the  person  of  interest  of  that  topic.” 

Typical  face  and  name  association  will  be  carried  out  by  human  within  news  video 
footage  (Figure  2)  as  follows: 

1.  In  the  hrst  scene,  a  news  caster  appears  and  talks  about  Mr.  Clinton.  It  shows 
that  this  topic  is  likely  to  be  about  Mr.  Clinton  and  his  face  will  appear  in 
the  subsequent  scenes. 

2.  The  next  scene  mainly  shows  a  certain  person  meanwhile  the  caster  is  still 
talking  about  Mr.  Clinton,  thus  the  person  may  be  Mr.  Clinton. 

3.  The  next  scene  shows  only  one  person  talking,  so  this  person  is  likely  to  be 
the  person  of  interest  of  this  topic.  Also,  it  is  recognized  that  this  person  is 
identical  to  the  previous  person,  therefore  this  person  who  might  be  the  person 
of  interest  is  probably  Mr.  Clinton. 
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MR.  CLINTON  VISITED  NORTHERN  Reporter:  MR. CLINTON  LIGHTED 

IRELAND  AND...  THE  CHRISTMAS  TREE,  ... 


I  PLEDGE  YOU  AMERICA'S  SUPPORT... 

Reporter:  PRESIDENT  CLINTON 
PLEDGED  THE  HELP  OF  U.S. 
INVESTIMENT. . . 


Figure  2:  Typical  Composition  of  News  Video 


Transcript 

BUT  ASK  BILL  MILLER  TO  LABEL 
HIS  MUSIC  AND  — 

>>  RIGHT  NOW  I'D  DESCRIBE 
MY  MUSIC  AS  VISIONARY  ROCK. 

>>  Reporter:  THAT  VISION  LOOKS 
BACK  AS  MUCH  AS  FORWARD, 

TO  MILLER'S  ROOTS  ON  A  WISCONSIN 
INDIAN  RESERVATION. 


ONE  WAY  OR  THE  OTHER 
NEWT  GINGRICH  IS  IN 
THE  PRESIDENTAL  RACE. 

>>  Reporter:  AND  JUDGING 

FROM  OTHER  NUMBERS 

IN  THE  POLL  --  GINGRICH  — 

AT  LEAST  WHAT  THE  PUBLIC  THINKS 
HE  STANDS  FOR  —  IS  A  FAIRLY  GOOD 
ISSUE  FOR  DEMOCRATS. 


What  is  the  name  of 


Who  is  NEWT  GINGRICH  ? 


? 


Figure  3:  Face  and  Name  Association 
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Video 


timing  information  3:14:35  12:42:06 


Figure  4:  Faces  Fxtraction  from  Video 

Although  human  can  do  this  easily,  it  is  very  hard  for  computers  to  achieve  auto¬ 
matically.  For  example,  face  extraction  and  identihcation  is  a  critical  vision  problem 
which  has  not  yet  been  completely  resolved,  name  extraction  includes  difficult  nat¬ 
ural  language  processing  problems,  and  scene/topic  detection  may  contain  both  vi¬ 
sion  and  natural  language  processing  problems.  There  are  several  techniques  which 
achieve  the  above  goals  including  scene  change  detection,  face  detection,  etc.  They 
are  still  far  from  complete,  but  it  may  be  promising  to  properly  integrate  those  tech¬ 
niques  to  get  useful  results.  We  took  this  strategy  to  bring  face  and  name  association 
to  fruition.  We  used  intensity  histogram  difference  based  scene  change  detection  [6], 
neural-network  based  face  detection,  eigenvector  method  based  face  identihcation, 
and  dictionary  based  name  extraction,  despite  these  being  still  incomplete.  Then 
we  integrate  these  techniques  to  collaboratively  use  image  sequences  and  transcript 
information  of  video  footage. 


4  Preparations 

4.1  Face  Extraction 

First  we  have  to  extract  faces  from  a  given  video  footage.  We  use  a  neural-network 
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Figure  5:  Face  normalization  based  on  eye  location 


based  face  detector  to  locate  faces  in  images  [7].  This  detector  can  locate  front  view 
faces  of  various  sizes  with  some  margin  in  rotation,  though,  it  takes  about  10  seconds 
with  a  workstation  (MIPS  R4400  200MHz)  to  process  a  352  x  240  image.  Thus  we 
hrst  apply  an  intensity  histogram  based  scene  change  detector  [6]  to  the  video  to 
obtain  scene  change  images,  then  apply  the  face  detector  to  scene  change  images. 
The  overall  face  extraction  process  is  shown  as  Figure  4.  For  example,  we  processed 
4.5  hours  news  video,  including  about  490,000  frames  (30  fps),  and  obtained  4,318 
scene  changes.  We  applied  the  face  detector  to  those  scene  change  images  to  obtain 
320  faces.  To  preserve  high  quality  face  images,  we  use  large  faces  (detected  as 
more  than  36  by  36  pixels)  of  which  both  eyes  are  successfully  detected  by  the  face 
detector.  In  this  example,  there  is  no  false  detection  among  all  detected  faces. 

Then  we  normalize  faces  using  eye  locations  (Figure  5).  Let  right  and  left  eye 
location  respectively  be  (rx^Vy)  and  (Ix^ly).  Fye  distance  d,  between  right  and  left 
eyes,  is  dehned  as 

d  —  lx  Tx-  (1) 


Normalized  face  location  {xi^yi)  —  (xu,yu)  is  dehned  as 

xi  =  Vx - d 

3 

-  lx  '^d 

f'x  lx  1  , 

yu  =  ^-  +  2^ 


(2) 

(3) 

(4) 

yi  =  yu  -  ^d.  (5) 

Then  face  images  are  normalized  into  64  X  64  images  for  face  similarity  evaluation. 


4.2  Face  Similarity 

We  used  an  eigenvector  based  method  to  compute  the  distance  between  two  faces. 
This  is  based  on  the  well  known  eigenface  method  [4].  Using  this  distance,  we  dehned 
a  similarity  between  two  faces. 

Let  face  images  be  two-dimensional  by  arrays  of  intensity  values.  These 
images  also  can  be  regarded  as  vectors  of  A^^-dimension.  Let  the  training  set  of  M 
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face  images  be  Fi,  F2, .  .  . ,  Fm-  We  can  define  the  covariance  matrix  C  as  follows; 


1  M 

(6) 

F,  =  F,-F 

(7) 

II 

^1- 

1  'T 

= 

M 

(8) 

where  A  =  [F1F2  ■  ■  ■  Fm]-  The  eigenvectors,  u^,  and  corresponding  eigenvalues 
(Ai  >  A2  >  ...  Am)  can  be  derived  from  C .  We  call  the  eigenvectors,  u^,  the 
eigenfaces. 

Assume  that  an  unknown  face  image,  F,  is  given.  Consider  M'{<  M)  eigenfaces 
having  top  M'  largest  eigenvalues.  We  can  define  an  M'-dimensional  eigenface 
subspace  by  converting  F  into 

$  =  [(j)i(j)2  -  -  -  (pM’V  (9) 

where 

*  =  uJ{F-F).  ^  ...  CO) 

The  distance  between  two  face  images  annotated  by  i  and  j  can  be  derived  using 
corresponding  points  and  within  the  eigenface  subspace  as  follows; 

<■  =  ii(^.-^.)ir-  _  (11) 

Then  the  similarity  between  them  is  defined; 

S{F..Ff.ai)  =  e  ^  ^  ^  (12) 

where  ctj  is  a  standard  deviation  of  Gaussian  filter  in  the  eigenface  subspace,  which 
will  be  empirically  determined.  We  also  have  to  determine  the  dimension  of  the 
eigenface  subspace  M',  and  we  used  M'  =  16  for  our  experimental  Name-It  embod¬ 
iment. 

4.3  Name  Candidates  Extraction 

Along  with  face  extraction,  we  extract  proper  nouns,  which  we  use  as  name  candi¬ 
dates,  from  news  transcripts.  Though  we  use  closed-caption  text  as  transcripts,  we 
can  extract  transcripts  from  sound  tracks  using  a  proper  speech  recognition  system. 
Closed-caption  consists  of  lines,  each  of  which  has  time  code  information  for  when 
to  show  each  line  on  the  screen. 

The  system  eliminates  special  characters  and  numerics  from  texts,  then  applies 
an  English  dictionary  to  separate  proper  nouns.  We  use  the  Oxford  Text  Archive 
textflO  dictionary,  which  is  freely  distributed  and  includes  more  than  70,000  words. 
A  given  word  is  assumed  to  be  a  name  (or  a  proper  noun)  if  (1)  the  word  is  annotated 
as  a  proper  noun  in  the  dictionary,  or  (2)  the  word  does  not  appear  in  the  dictionary. 

Figure  6  depicts  the  diagram  of  name  candidate  extraction,  an  example  of  a  tran¬ 
script,  and  extracted  proper  nouns. 


7 


Closed-Caption  Sound  Track 


Transcript  taking  risks  for  peace  is  a 

THEME  PRESIDENT  CLINTON  SAID 
SHOULD  APPLY  FROM  BOSNIA  TO 
BELFAST. 

THOSE  SENTIMENTS  FOUND  A 
RECEPTIVE  AUDIENCE  IN  NORTHERN 
IRELAND  TODAY,  WHERE  MR.  CLINTON, 
THE  FIRST  AMERICAN 
PRESIDENT  TO  VISIT  THE  NORTH, 


(Name)  bosnia 

BELFAST 

IRELAND 

CLINTON 


Figure  6:  Name  Candidates  Extraction 
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Fa: 


Co-occurrence  C(Face,  Name) 

C(Fa,  CLINTON)  >  C(Fa,  GINGRICH),  C(Fa,  DOLE), .... 

C(Fa,  CLINTON)  >  C(Fb,  CLINTON),  C(Fc,  CLINTON), .... 

Figure  7:  An  Example  of  Face  and  Name  Co-occurrence 

5  Face  and  Name  Co-occurrence 

5.1  Overview  of  the  Method 

To  achieve  face  and  name  association  (e.g.,  to  get  associated  faces  by  giving  a  cer¬ 
tain  name,  or  vice  versa),  we  introduce  a  face  and  name  co-occurrence  factor  C (N^  F) 
of  a  face  F  and  a  name  N.  To  acquire  co-occurrence,  we  first  inspect  occurrence  of 
extracted  names  within  transcripts  and  occurrence  of  extracted  faces  within  videos 
along  the  same  time  scale.  Then  co-occurrence  is  calculated  to  represent  how  well 
the  name  and  the  face  “co-occur”  in  time,  i.e.,  a  face  F  tends  to  appear  in  the  videos 
when  a  name  N  appears  in  the  transcripts  and  vice  versa,  while  F  tends  not  to  ap¬ 
pear  when  N  does  not  appear.  This  factor  C{N,  F)  is  expected  to  give  larger  value 
when  the  face  F  tends  to  have  the  name  N .  For  example,  think  of  faces  Fa^  Fj,^ .  . . 
and  names  Clinton,  Gingrich,.  .  .  as  shown  in  Figure  7.  Obviously  Fa  corresponds 
to  Clinton,  and  Fj,  to  Gingrich.  In  this  case,  the  co-occurrence  G(Fa, Clinton)  gives 
a  larger  value  than  any  other  combinations  of  co-occurrence  between  Fa  and  any 
other  names,  or  between  any  other  faces  and  “Clinton.” 

Once  co-occurrence  is  dehned,  face-to-name  or  name-to-face  association  can  be 
acquired  in  a  straightforward  manner.  Consider  the  situation  where  a  face  is  given, 
and  the  associated  names  are  desired.  To  achieve  this,  we  compute  co-occurrence 
between  the  face  and  every  name  candidate,  then  sort  names  by  co-occurrence  to 
get  names  having  top-N  largest  co-occurrence  values.  These  names  may  be  a  good 
estimation  of  the  name  of  the  face.  Name-to-face  association  can  be  achieved  simi¬ 
larly. 

In  this  section,  occurrences  of  names  and  faces  are  dehned  as  functions  over  time. 
We  call  these  occurrence  functions,  and  use  them  to  dehne  the  co-occurrence  factor. 

5.2  Occurrence  Functions 

Assume  that  all  occurrences  of  a  word  N  are  extracted  from  a  given  transcript.  Let 
occurrences  of  N  be  tjv,i,  fjv,2,  •  •  v  he.,  a  word  N  occurs  at  tjv,i,  fjv,2,  •  •  •  in  the  video. 
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Figure  8:  Occurrence  Function 


The  name  occurrence  function  is  defined  as 

Ona-meXt-.N)  =  (13) 

i 

where  8{t)  is  a  Dirac  delta  function. 

Let  Fi,  F2,  •  •  •  be  a  set  of  faces.  We  can  evaluate  face  similarity  5'(Fy  Fj)  as  given 
in  Section  4.2,  e.g.,  S{Fi,Fj)  >  S{Fi,Fk)  if  Fi  is  more  similar  to  Fj  than  to  F^. 
Assume  that  faces  Fi,  F2,  •  •  •  occur  at  tp^  ,tp^p  ■  ■  within  the  video  footage.  The  face 
occurrence  function  of  a  face  F  is  dehned  as 

OjaceiXF)  =  ^F(F,F0h(t-tF,).  (14) 

i 

Name  and  face  occurrence  functions  are  dispersed  using  a  Gaussian  hlter. 

OL,rJt-,N)  =  O..„„((;AO0e“F,  (15) 

=  0,,„(UF)»e~F  ^  (16) 

where  at  is  a  standard  deviation  for  a  temporal  Gaussian  hlter,  and  0  represents  the 
convolution  operator.  Intuitively,  represents  the  likelihood  of  occurrence 
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of  the  word  N,  i.e.,  if  N)  >  N) ,  the  video  at  ti  is  more  likely 

about  N  than  at  ^2-  Similarly,  F)  represents  likelihood  of  occurrence  of  the 

face  F,  i.e.,  if  0/ace(^ij  ^  ^/ace(^2;  F)^  the  video  at  ti  is  more  likely  about  F  than 
at  ^2-  Figure  8  illustrates  the  relationship  between  occurrence  of  names  and  faces, 
and  occurrence  functions. 


5.3  Co-occurrence 


From  dehned  name  and  face  occurrence  functions,  we  dehne  a  name-face  co¬ 
occurrence  C{N^F). 

(  /  N)  ■  F)dty 

JDt 


C{N,F)  = 


O'nameit-,  N)dt  Oface{t;  F)dt 

J  Dt 


(17) 


(18) 


[  Onameit;N)dt  [  0'.^^{t;F)dt 

/Dt  JDt 

where  p  is  an  empirically  determined  constant  {p  >  1),  and  Dt  is  the  time  domain  of 
the  video.  We  see  that  when  the  peaks  of  Oname{t]  N)  and  F)  overlap,  the 

numerator  will  increase,  but  when  they  are  offset,  their  product  will  be  near  zero  and 
the  numerator  will  be  small  (See  Figure  8.).  Suppose  that  F)  =  1  almost 

everywhere,  say,  white  noise.  This  is  the  case  of  an  anchor  person’s  face  which  may 
coincide  with  almost  any  names,  i.e.,  the  face  occurs  anytime  without  any  relation 
to  occurrences  of  any  names.  There  is  no  difference  between  a  numerator  with  white 
noise  face  occurrence  and  an  arbitrary  Oname{t]  N),  and  a  numerator  with  F) 

having  1  only  where  Oname{t]  N)  does  not  equal  to  zero,  i.e.,  the  name  occurs.  An 
ideal  co-occurrence  should  be  small  if  F)  has  a  large  value  where  Oname{t]  N) 

is  small.  Thus  we  normalize  the  numerator  by  the  size  of  Oname{t]  N)  and  O jace{t]  F). 
To  understand  why  the  constant  p  should  be  greater  than  1,  consider  the  case  when 
O'namei^^  "^)  ^  value  larger  than  zero  at  to  and  Ojace{t]  F)  =  kS[t  —  to)  [k  >  0).  If 

p  =  1,  the  value  of  k  has  no  effect  on  C{N,  F),  whereas  ideally,  as  k  becomes  larger, 
the  co-occurrence  should  also  increase.  To  accomplish  this,  we  choose  a  value  for  p 
greater  than  1.  We  also  note  that  its  magnitude  is  constrained  since  a  very  large 
value  for  p  will  undo  the  normalization.  In  the  experimental  system  described  later, 
we  used  p  =  1.7,  although  the  system  worked  hne  with  p  =  1.5  ~  2.0. 


6  Face-Name  Association  Method 

The  basic  method  for  providing  face-name  association  is  simple.  To  associate  a 
given  face  F  to  names,  we  calculate  co-occurrence  factor  C{N^F)  for  every  name 
candidates  sort  the  names  by  the  factors,  and  give  top-N  names  as  the  result. 
Association  of  a  given  name  to  faces  can  be  given  as  well.  Although  this  is  simple, 
obtaining  a  co-occurrence  factor  requires  signihcant  computation,  and  a  number  of 
co-occurrence  factors  are  needed  to  acquire  an  association  result.  It  may  thus  require 
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impractical  computation  time.  Precomputation  of  the  results  for  possible  queries 
may  overcome  this  drawback,  particularly  the  name  to  face  associations  since  a  hnite 
set  of  name  candidates  might  be  given  for  target  videos.  However,  face-to-name 
association  cannot  be  precomputed  because  a  given  face  is  not  restricted  to  any  hnite 
set  as  it  is  unknown  beforehand.  To  cope  with  this  problem,  we  introduce  a  more 
efficient  algorithm.  First  we  give  a  conversion  of  the  name  occurrence  function  over 
time  into  the  name  occurrence  over  face  space,  and  reveal  that  this  is  equivalent  to 
C{N,  F).  Then  we  describe  an  efficient  algorithm  to  compute  co-occurrence  factors. 


6.1  Conversion  of  the  Name  Occurrence  Function 


Assume  /  is  a  variable  point  within  an  eigenface  subspace  (/  G  T  =  ).  Let 

Onameif]  N)  be  the  name  occurrence  function  over  the  eigenface  subspace. 

dn„M\N)  =  Y.W  -  1  (19) 


where  h(/)  =  <^(||/|P),  and  f{F)  gives  a  corresponding  point  within  the  eigenface 
subspace  of  the  face  F.  Then  the  name  occurrence  function  over  the  eigenface 
subspace  is  dispersed  using  a  Gaussian  hlter  over  the  eigenface  subspace; 


OL.M\N)  = 


=  Y. 


3 

e  ^ , 

i 


^  2(7, 

(g)  e 


(20) 

(21) 

(22) 


Next  we  demonstrate  an  equivalence  between  the  numerator  of  co-occurrence  factor 
(18)  and  iV). 

0'aa-me{t'.  N)  ■  Oface{t]F)dt 


iDt 


Id, 


2at 


•E 


- 2 - 


2a  f 

g  j  ace 


S{t  —  tFi)dt. 


3 


=  E' 


(tF,-tN,,r_  ii/(ffi-/(G)T 
e 


2at 


2(7^ 

J  ace 


hJ 


Therefore  the  co-occurrence  factor  is  given  as 

(o;.„,,(/(r’);A0)>’ 


C{N,F]  = 


(23) 

(24) 

(25) 

(26) 

(27) 


j  j  0'i^,3FF)dt 

Figure  9  shows  composition  of  name  occurrence  function  over  the  eigenface  subspace 
from  face  and  name  occurrence  functions. 
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Figure  9:  Conversion  of  Name  Occurrence  Function 

6.2  Face-Name  Association  Algorithm 

The  name  occurrence  function  over  the  eigenface  subspace  t)e  ex¬ 

pressed  as  follows; 

wf-nm? 

A(/)  =  e  (28) 

5;/'  =  e  2.?  (29) 

o:.„».(/;A0  =  E(-4’(/)EB.i;')-  (30) 

«  3 

Assume  that  a  face  F  [f  =  f{F))  is  given  to  provide  associated  names.  It  is 
noteworthy  that  can  be  precomputed  for  each  name  candidate  N  in 

this  situation,  and  this  precomputation  will  greatly  reduce  the  computation  time. 
We  precompute  B^^  for  each  name  candidate  N  and  each  reference  face  Fi.  We  also 
prepare  slots  for  of  the  number  of  name  candidates  times  the  number  of  reference 
faces.  In  general  the  number  of  name  candidates  will  be  several  hundred  and  the 
number  of  reference  faces  will  be  several  thousand,  therefore  slots  to  store  B^j^  will  be 
on  the  order  of  a  million  and  it  is  thus  practical  to  store  them  in  secondary  storage. 

To  get  associated  names  of  a  given  face,  A®(/)  should  be  computed  on  demand. 
While  this  requires  the  number  of  reference  faces  iterations  to  compute  Oname{f]  A^), 
most  of  the  iterations  are  negligible  due  to  Gaussian  hltering,  i.e., 

dname{f;N)  =  (31) 
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r\j 


(32) 


*e{*ll|/-/(-P’OII<^<^/ace} 

where  A;  is  a  constant  to  control  reduction  of  iterations.  To  select  A®(/)  satisfy¬ 
ing  11/  —  f[Fi)\\  <  kcTface^  it  is  required  to  retrieve  adjacent  points  from  a  given 
point  in  multidimensional  space  since  /  and  f{Fi)  are  M'-dimensional  points.  To 

achieve  this  retrieval,  spatial  data  structure  methods  like  R-tree  [8]  can  be  used. 
Roughly  speaking,  these  methods  compose  tree  structures  from  given  M  points 
or  (hyper-)rectangles  in  multidimensional  space,  and  provide  spatial  retrieval,  i.e., 
they  enumerate  all  data  which  overlap  the  given  (hyper-)rectangle,  with  O(log  M) 
computation. 

To  compute  the  co-occurrence  factor  C{N,F),  we  need  to  obtain  the  numerator 
of  Eq.  (27),  i.e., 

num{N)  (33) 

V^at  f  Ojace{t;  F)dt  (34) 

JDt 

Y.S{F,Fi)  (35) 


I  Onameit-,  N)dt  = 

Dt 

/  0'jaceit'^F)dt  = 

J  Dt 

f  Oface{t;F)dt  = 
JDt 


ll/R)-/RdlP 

=  V  e 


(36) 


where  num[N)  is  the  number  of  occurrences  of  the  word  N  in  the  videos  which  can 
be  precomputed.  Note  that  the  given  face  F  is  hxed  with  varying  name  candidates. 
In  other  words,  Eq.  (35)  makes  no  contribution  to  the  difference  among  co-occurrence 
factors  of  various  name  candidates.  Therefore,  this  can  be  omitted  when  evaluating 
the  co-occurrence  factor  for  selecting  associated  names  of  the  face. 

Einally,  to  obtain  the  associated  names  of  the  face,  we  need  to  evaluate  the  co¬ 
occurrence  factor  C{N^F)  for  each  name  candidate  N.  In  total,  this  task  requires 
to  evaluate  co-occurrences  the  number  of  candidate  words  (njv)  times,  each  of  these 
evaluations  requires  the  number  of  faces  {np)  iterations.  The  spatial  data  structure 
may  drastically  reduce  the  number  of  iterations  with  only  O(log  np)  computation. 
This  realizes  a  sufficiently  fast  computation,  even  for  interactive  systems. 

Next,  consider  obtaining  associated  faces  from  a  given  name  N.  The  basic  idea  is 
to  select  faces  F  of  which  co-occurrence  factor  C{N,  F)  has  large  values.  Resultant 
faces  may  be  selected  among  reference  face  set  {Fi,  F2,  •  •  •}.  The  system  computes 
C{N^Fi)  for  each  reference  face  Fi,  sorts  co-occurrences,  and  selects  Fi  having  the 
top-N  values  of  co-occurrence.  Obviously,  co-occurrence  Eq.  (27)  is  undehned  for 
a  word  which  is  not  included  in  the  set  of  name  candidates  obtained  from  videos. 
Also  it  is  quite  reasonable  to  restrict  a  given  word  to  be  one  of  the  predehned  name 
candidates.  Since  the  number  of  name  candidates  is  hnite  and  it  might  not  be  so 
large  (around  several  hundred),  it  is  quite  practical  to  precompute  associated  faces 
for  each  name  candidate.  In  addition,  the  method  for  computing  co-occurrences 
C{N,  F),  explained  above,  can  be  applied  to  achieve  efficient  precomputation. 
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Who  is  he? 


1.  MILLER  30.4 

2.  NASHVILLE  11.17 

3.  DON  4.92 

(BILL  MILLER,  singer) 

Who  is  "KELLY"? 


22.52 

(GENE  KELLY) 

KELLY  MADE  HIS  FILM  DEBUT 
WITH  JUDY  GARLAND  IN  1942. 

BUT  HE' S  BEST  KNOWN 
FOR  THE  CLASSIC  "SINGIN' 

IN  THE  RAIN." 

Figure  10:  Face-Name  Association  Results 

7  Experimental  Results 

We  implemented  the  described  algorithm  on  an  SGI  Indigo2  workstation  (MIPS 
R4400  200MHz).  Since  it  is  an  experimental  embodiment,  we  have  not  yet  im¬ 
plemented  either  search  acceleration  using  spatial  data  structure  in  face-to-name 
association,  or  precomputation  in  name-to-face  association.  The  system  was  ap¬ 
plied  to  9  CNN  Headline  News  video  (30  minutes  each);  in  total  4.5  hours  of  news 
videos.  From  them  4,318  scene  changes  were  detected  and  320  faces  extracted.  The 
system  is  also  given  251  candidate  names  which  were  extracted  from  all  transcripts. 

Figure  10  shows  several  results  of  face  and  name  association.  The  upper  row  shows 
face-to-name  associations.  Fach  image  was  given  to  the  system  as  an  input  face  and 
a  list  of  name  candidates  is  output  with  corresponding  co-occurrence  factors.  The 
top  3  name  candidates  are  shown  for  each  face  image.  The  leftmost  face  is  a  singer. 
Bill  Miller,  and  the  system  output  “MILLFR”  as  the  hrst  candidate.  The  next  face 
is  also  a  singer.  Garth  Brooks,  which  was  its  second  candidate,  “GARTH,”  while 
the  hrst  candidate  “AFROSMITH”  was  said  by  reporter  many  times  because  the 
reporter  introduced  Brooks  singing  Aerosmith’s  song.  The  next  face  is  a  leader  of 
Sinn  Fein,  Gerry  Adams.  Since  “SINN”,  “FFIN”,  “GFRRY”,  “ADAMS”  were  all 
marked  as  proper  nouns  and  “SINN”  and  “FFIN”  are  frequently  spoken,  the  face 
matched  well  with  “SINN”  and  “FFIN.”  The  rightmost  face  is  the  former  offensive 
coordinator  of  the  NFL’s  Pittsburgh  Steelers,  Ron  Frhardt.  The  system  output 
“FRHARDT”  as  the  hrst  candidate  successfully  as  well  as  “STFFLFRS”  as  the 


1. AEROSMITH  3.19 

2.  GARTH  2.37 

3.  GUY  2.33 

(GARTH  BROOKS,  singer) 


1. FEIN  1.47 

2.  SINN  1.45 

3.  AUTHER  0.66 
(GERRY  ADAMS, 
leader  of  SINN  FEIN) 


1. ERHARDT  5.01 

2.  STEELERS  4.91 

3.  BEHRING  2.36 
(RON  ERHARDT, 
Pittsburgh  STEELERS) 


21.50 


18.83 
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second  candidate. 

The  lower  row  shows  the  results  of  name-to-face  association.  The  name  “KELLY” 
was  given  to  the  system  and  top  3  candidates  the  system  generated  are  shown  with 
co-occurrence  factors.  Those  faces  are  all  Gene  Kelly,  a  movie  actor.  Typically,  it 
takes  about  7  sec.  to  get  face-to-name  association  results,  and  about  2  sec.  to  get 
name-to-face  association. 

These  examples  are  successful  results,  mostly  because  they  were  introduced  as 
special  topics  about  these  persons.  In  those  topics,  their  names  are  said  by  an¬ 
chors/reporters  repeatedly,  and  their  footage  is  also  shown  with  their  close-up  shots. 
In  such  cases,  the  method  extracted  meaningful  face  and  name  associations.  But 
there  are  some  cases  where  the  method  is  not  applicable.  A  typical  case  is  with 
the  current  American  president,  “Mr.  Clinton”  footage.  Mr.  Clinton  is  reported 
almost  everyday,  several  times  a  day,  however,  at  least  in  CNN  Headline  news,  most 
reports  are  given  by  anchor  persons,  and  only  in  their  narration  without  any  footage 
of  Clinton.  Since  the  method  neither  discriminates  anchor  persons  nor  recognizes 
the  topics  without  footage  of  the  person  of  interest,  it  thus  far  cannot  associate 
Clinton’s  face  and  name.  The  method  will  give  less  co-occurrence  factor  to  anchors 
which  coincide  with  many  names  as  described  in  Section  5.3,  and  thus  the  method 
does  not  need  to  give  special  treatment  to  anchors.  However,  it  is  necessary  to  dis¬ 
criminate  anchors  to  cope  with  this  “Mr.  Clinton”  case.  In  addition,  much  higher 
natural  language  processing  techniques  and  knowledge  of  news  video  structure  may 
be  needed. 


8  Conclusions 

This  paper  describes  a  face  and  name  association  method  in  videos.  The  method  is 
provided  with  news  videos  showing  persons  of  interest,  then  given  a  name  and  re¬ 
turns  associated  faces,  or  given  a  face  and  returns  associated  names.  The  successful 
results  demonstrate  that  face  and  name  association  provides  useful  video  content  in¬ 
formation,  thus  Name-It  contributes  to  automated  video  indexing.  In  addition,  the 
successful  results  of  Name-It  reveal  that  our  methodology  of  integrated  use  of  image 
understanding  techniques  and  natural  language  processing  techniques  is  headed  in 
the  right  direction  to  achieve  our  goal  of  accessing  real  contents  of  multimedia  infor¬ 
mation.  Because  the  method  is  still  not  fully  robust,  i.e.,  there  are  some  cases  the 
method  cannot  analyze,  we  are  improving  this  method  by  using  higher  level  natural 
language  processing,  i.e.,  applying  a  thesaurus  and  parsing,  as  well  as  much  higher 
image  processing,  i.e.,  face  tracking  and  extracting  face  occurrence  duration,  and 
enhancement  of  face  identihcation  method.  This  research  was  done  with  support 
from  the  Informedia  digital  video  library  project,  and  more  than  100  hours  of  news 
videos  are  available.  We  would  like  to  apply  this  method  to  this  archive  after  the 
method  acquires  more  robustness.  This  might  explain  a  practical  effectiveness  of 
the  method,  and  as  an  outcome  of  this,  we  can  obtain  a  faces  and  names  association 
database  containing  more  than  several  hundred  of  people. 
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